Poster
Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models
Alina Shutova · Vladimir Malinovskii · Vage Egiazarian · Denis Kuznedelev · Denis Mazur · Surkov Nikita · Ivan Ermakov · Dan Alistarh
East Exhibition Hall A-B #E-2603
When LLMs generate text, they need to maintain a memory of previous tokens in the form of attention keys and values: tens of thousands of numbers for each token.For tasks where LLM deals with long texts, this adds up to tens of gigabytes of GPU memory for every sequence in a batch.To avoid running out of GPU memory, people have been compressing KV vectors — quantizing or pruning them.We propose a better way of compressing these keys and values: instead of quantizing them individually, we exploit the mutual information between different layers to quantize them together.Our approach fits a simple linear classifier to predict adjacent layer key-values and only store the part that cannot be predicted.This allows us to compress KV vectors with significantly better accuracy, especially for extreme 2-bit quantization.
Live content is unavailable. Log in and register to view live content