EAKV: An Entropy-Driven Adaptive KV Compression Framework for Long Video Understanding
Abstract
Although Multimodal Large Language Models have made remarkable progress, they still struggle with long-video understanding due to the massive memory footprint of KV Caches. Exsiting methods often resort to disjoint retrieval or attention-based static reduction to achieve compression. However, these methods disrupt temporal continuity and ignore the varying information density across network layers. In this work, we reveal that memory allocation should mirror layer-wise semantic density, rather than adhering to a uniform budget. To this end, we introduce EAKV, a training-free entropy-driven adaptive KV compression framework that leverages attention entropy to adaptively allocate compression budgets, selectively preserving critical tokens while distilling redundant contexts into compact contextual anchors, thereby achieving granular memory allocation proportional to semantic density. Extensive experiments on four benchmarks demonstrate that EAKV surpasses existing methods across varying model scales with improvements ranging from 1.5% to 4.8%.