WinQ: Accelerating Quantization-Aware Training of Large Language Models around Saddle Points
Dongyue Li ⋅ Zechun Liu ⋅ Kai Yi ⋅ Changsheng Zhao ⋅ Zhenshuo Zhang ⋅ Raghuraman Krishnamoorthi ⋅ Harshit Khaitan ⋅ Hongyang Zhang ⋅ Steven Li
Abstract
Quantization-aware training is widely used for language model quantization in sub-4-bit precision, by training full-precision weights with gradients computed on the quantized model. The main bottleneck for this training approach is its slow convergence and plateauing of test performance, which gets worse in lower bit-widths. While observed in prior work, its precise cause has not been carefully studied. In this paper, we analyze the convergence by computing the Hessian spectrum of the model loss throughout quantization-aware training. We find the key reason is that the model weights converge to flat surfaces near saddle points, with a large fraction of Hessian eigenvalues concentrated around zero, and the magnitude of both positive and negative eigenvalues decreases over training. Additionally, the convergence speed is slower in lower bit-widths with significantly smaller Hessian eigenvalue magnitude. Motivated by these findings, we propose an approach to accelerate quantized training with minimal overhead named WinQ. This approach periodically performs linear weight interpolation between the full-precision and quantized weights and computes gradients on noise-injected weights. Both techniques effectively regularize the Hessian and accelerate training, resulting in an algorithm broadly applicable to quantization methods. Extensive experiments show that WinQ accelerates various quantized training methods by up to 4$\times$. Under the same training budget, WinQ improves state-of-the-art sub-4-bit quantization performance by up to 8.8% relatively. Additionally, WinQ remains consistently effective across 16 settings of different language models, quantization methods, and bit-widths.
Successful Page Load