GSRQ: Gain-Shape Residual Quantization for Sub-1-bit KV Cache
Soosung Kim ⋅ Minjae Park ⋅ Eui-Young Chung ⋅ Jaeyong Chung
Abstract
The deployment of Large Language Models (LLMs) with extended context windows is increasingly constrained by the linear growth of Key-Value (KV) cache memory. Vector Quantization (VQ) is a key enabler for pushing KV cache storage toward the sub-1-bit regime; in particular, Residual Quantization (RQ) supports this goal via progressive refinement, sequentially encoding residuals with small codebooks across stages. Yet most VQ methods still rely on standard $\ell_2$ $K$-means as the core codebook-learning primitive. We identify a subtle high-dimensional issue: Euclidean centroid averaging can induce centroid shrinkage, and under an $\ell_2$ objective this shrinkage reduces the influence of angular alignment in the distortion term. This coupling can make directional preservation harder to maintain, hindering KV cache vector quantization methods from pushing into the sub-1-bit regime. To mitigate this coupling, we propose Gain-Shape $K$-means (GSKM), a drop-in replacement for $K$-means that improves directional fidelity over standard $K$-means while matching, and in some regimes improving, $\ell_2$ distortion. We build Gain-Shape Residual Quantization (GSRQ) by incorporating a weighted extension of GSKM into a RQ pipeline. On LLaMA-3-8B, GSRQ yields substantial improvements over strong KV cache quantization baselines across bit rates. At 1-bit, our method improves the average accuracy across LongBench tasks from 11.34 to 32.26, a gain of 20.92 percentage points over VQLLM.
Successful Page Load