WaterSIC: information-theoretically (near) optimal linear layer quantization
Abstract
This paper considers the problem of converting a given dense linear layer into a low-precision version. The tradeoff between minimizing description length and discrepancy introduced at the output of the layer is analyzed information theoretically (IT). It is shown that the popular GPTQ algorithm may have an arbitrarily large gap to IT limit. To alleviate this problem a novel algorithm, termed ''WaterSIC'', is proposed and is shown to be within a rate gap of 0.255 bit to IT limit, uniformly over all possible covariance matrices of input activations. WaterSIC's key innovation is allocating different quantization rates to different columns (in-features) of the weight matrix, mimicking the classical IT solution known as ''waterfilling''. Applying WaterSIC to real LLMs establishes new state-of-the-art for rates in the range of 1...4 bits per entry.