Poster

xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction

Chi-Chih Chang ⋅ Wei-Cheng Lin ⋅ Chien-Yu Lin ⋅ Hung-Yueh Chiang ⋅ Yash Akhauri ⋅ Xilai Dai ⋅ Huiqiang Jiang ⋅ Yucheng Li ⋅ Kai-Chiang Wu ⋅ Luis Ceze ⋅ Mohamed Abdelfattah

Abstract

Long-context Large Language Models (LLMs) enable powerful applications but incur high memory costs due to the key–value states (KV-Cache). Recent studies attempt to share KV-Cache across layers, but these approaches either require expensive pretraining or rely on per-token cross-layer cosine similarity that is often limited in practice. We show, via Centered Kernel Alignment (CKA), that the dominant singular vectors of KV-Cache are well aligned across layers. Motivated by this observation, we propose xKV, a post-training compression method that jointly factorizes grouped-layer KV-Cache into a shared low-rank subspace, substantially reducing KV-Cache memory. Across widely used LLMs, xKV achieves up to 8× KV-Cache compression while preserving accuracy on long-context tasks and in multi-turn settings. To further improve efficiency, we introduce Selective Reconstruction (SR) at decode time. Combined with SR, xKV achieves up to 4.23× end-to-end speedup, surpassing notable baselines with 30% higher throughput under a similar accuracy level. Overall, xKV provides a plug-and-play approach to reduce both memory and latency for long-context LLM inference. Our code will be open-sourced.