GPTVQ: The Blessing of Dimensionality for LLM Quantization
Abstract
Large language models (LLMs) necessitate huge DRAM footprint and memory bandwidth costs, severely limiting deployment on mobile devices. This work demonstrates that non-uniform quantization in one or more dimensions can significantly ease this memory bottleneck. We provide analysis and experimental results to show that the model size versus accuracy trade-off of neural network quantization markedly improves when increasing the quantization dimensionality. To exploit this, we propose GPTVQ: an efficient method that extends GPTQ to non-uniform and vector quantization (VQ). GPTVQ establishes state-of-the-art results in model size vs accuracy across a wide range of LLMs, including Llama-v2/v3 and Mistral. Furthermore, our method is fast: on a single H100 it takes between 3 and 11 hours to process Llamav2-70B. Finally, we show that VQ is practical, by demonstrating simultaneous reduction in DRAM footprint and latency on a VQ quantized LLM on a mobile class Arm CPU, and a desktop Nvidia GPU.