CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs
Abstract
In this paper, we present CAT-Q, Cost-efficient and Accurate Ternary Quantization, to compress LLMs. Unlike current state-of-the-art ternary quantization methods that rely on data-intensive and costly quantization-aware training to mitigate severe performance degradation, CAT-Q employs a simple yet effective post-training quantization scheme, thereby is easily applicable to LLMs with diverse architectures and model sizes. It has two key components, learnable modulation (LM) and softened ternarization (ST), which are coupled from an optimization perspective. LM leverages a composition of learnable factors to modulate the distribution of high-precision weights and the ternary threshold, making them less sensitive to ternarization. ST further introduces a novel transition function to guide the ternarization process toward stable convergence. We show that, for pre-trained LLMs with 1.7B to 8B parameters, CAT-Q can quantize them into ternary models using merely 512 calibration samples, while achieving competitive performance to the seminal BitNet 1.58-bit v1 and v2 families (with 1.3B to 7B parameters) trained with 100B tokens, yielding about a 100,000x reduction in training tokens. Moreover, we show for the first time that CAT-Q can quantize even larger pre-trained LLMs having 14B to 235B parameters into leading ternary models within 8 to 60 hours on 8 A100-80GB GPUs. Code will be made publicly available.