Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

GPTVQ: The Blessing of Dimensionality for LLM Quantization

Marinus van Baalen · Andrey Kuzmin · Markus Nagel · Peter Couperus · Artem Bolshakov · Cedric Bastoul · Eric Mahurin · Tijmen Blankevoort · Paul Whatmough

Project Page [ OpenReview]

Abstract

Large language models (LLMs) necessitate huge DRAM footprint and memory bandwidth costs, severely limiting deployment on mobile devices. This work demonstrates that non-uniform quantization in one or more dimensions can significantly ease this memory bottleneck. We provide analysis and experimental results to show that the model size versus accuracy trade-off of neural network quantization markedly improves when increasing the quantization dimensionality. To exploit this, we propose GPTVQ: an efficient method that extends GPTQ to non-uniform and vector quantization (VQ). GPTVQ establishes state-of-the-art results in model size vs accuracy across a wide range of LLMs, including Llama-v2/v3 and Mistral. Furthermore, our method is fast: on a single H100 it takes between 3 and 11 hours to process Llamav2-70B. Finally, we show that VQ is practical, by demonstrating simultaneous reduction in DRAM footprint and latency on a VQ quantized LLM on a mobile class Arm CPU, and a desktop Nvidia GPU.

Chat is not available.