Skip to yearly menu bar Skip to main content


Poster

Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models

Alina Shutova · Vladimir Malinovskii · Vage Egiazarian · Denis Kuznedelev · Denis Mazur · Surkov Nikita · Ivan Ermakov · Dan Alistarh

East Exhibition Hall A-B #E-2603
[ ] [ ]
Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key \& Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) the existence of high-compression methods for internal network states (e.g. attention Keys \& Values). We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to "optimally" compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under $1\%$ relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU within 1-6 hours, even for 70B models.

Lay Summary:

When LLMs generate text, they need to maintain a memory of previous tokens in the form of attention keys and values: tens of thousands of numbers for each token.For tasks where LLM deals with long texts, this adds up to tens of gigabytes of GPU memory for every sequence in a batch.To avoid running out of GPU memory, people have been compressing KV vectors — quantizing or pruning them.We propose a better way of compressing these keys and values: instead of quantizing them individually, we exploit the mutual information between different layers to quantize them together.Our approach fits a simple linear classifier to predict adjacent layer key-values and only store the part that cannot be predicted.This allows us to compress KV vectors with significantly better accuracy, especially for extreme 2-bit quantization.

Live content is unavailable. Log in and register to view live content