EntroKV: Entropy-Guided Dynamic Budget Allocation for KV-Cache Compression
Wenhao Gao ⋅ Haoran Cao ⋅ Yueyan Li ⋅ YongGao Xiao ⋅ Caixia Yuan ⋅ Xiaojie Wang
Abstract
The prohibitive memory footprint of the Key-Value (KV) cache imposes a critical bottleneck for efficient long-context LLM serving. Current compression techniques typically rely on static or uniform budget allocation, overlooking the significant heterogeneity in information density across attention heads. To address this, we introduce \textsc{EntroKV}, an entropy-driven dynamic budget allocation framework. Our method enables dynamic and rational allocation across layers, attention heads, and different tasks. We demonstrate that attention entropy serves as a robust proxy for compression sensitivity: heads with high entropy require larger retention budgets, whereas low-entropy heads can be aggressively compressed without accuracy degradation. Functioning as a lightweight, plug-and-play module, \textsc{EntroKV} optimizes budget scheduling in real-time and is compatible with diverse compression operators. Extensive experiments demonstrate that \textsc{EntroKV} consistently outperforms baselines, retaining $\sim$98\% of full-cache performance at a 30\% budget ratio with negligible computational overhead. Our code is available at \url{https://anonymous.4open.science/r/EntroKV-D0C8/}.
Successful Page Load