Fast KV Compaction via Attention Matching
Adam Zweiger ⋅ Xinghong Fu ⋅ Han Guo ⋅ Yoon Kim
Abstract
Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through *compaction* in token space via summarization. However, summarization can be highly lossy, substantially harming downstream performance. Recent work on Cartridges (Eyuboglu et al., 2025) has shown that it is possible to *train* highly compact KV caches in latent space that closely match full-context performance, but at the cost of slow and expensive end-to-end optimization. This work describes an approach for *fast* context compaction in latent space through **Attention Matching**, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level. We show that this formulation naturally decomposes into simple subproblems, some of which admit efficient closed-form solutions. Within this framework, we develop a family of methods that significantly push the Pareto frontier of compaction time versus quality, achieving up to $50\times$ compaction in seconds on some datasets with little quality loss.
Successful Page Load