Poster

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

Payman Behnam ⋅ Yaosheng Fu ⋅ Ritchie Zhao ⋅ Po-An Tsai ⋅ Zhiding Yu ⋅ Alexey Tumanov

2025 Poster

[ Slides] [ Poster] [ OpenReview]

Abstract

Transformer-based Large Language Models rely critically on the KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy containing two consecutive stages. In the first stage, it performs coarse-grain permanent KV cache eviction on the input sequence tokens. In the second stage, it adopts a hybrid sparse attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensionality reductions. We show that RocketKV provides a compression ratio of up to 400×, end-to-end speedup of up to 3.7× as well as peak memory reduction of up to 32.6% in the decode phase on an NVIDIA A100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks. We also propose a variant of RocketKV for multi-turn scenarios, which consistently outperforms other existing methods and achieves accuracy nearly on par with an oracle top-k attention scheme.

Lay Summary

Large language models (LLMs), like ChatGPT, are excellent at generating human-like text but struggle when handling very lengthy documents. Their main issue arises from the need to remember extensive past information, causing them to slow down as text generation continues. To overcome this challenge, we developed a training-free technique called RocketKV. RocketKV compresses the memory requirement in two consecutive stages. In the first stage, it permanently removes information determined to be less important, significantly reducing memory usage right away. In the second stage, RocketKV carefully selects the most relevant pieces of the remaining memory based on the current query, dynamically deciding what to prioritize at each text generation step. This two-stage strategy ensures that the model can efficiently maintain essential information without becoming overwhelmed by irrelevant details. By intelligently combining permanent pruning with dynamic selection, RocketKV dramatically reduces memory needs by up to 400 times and speeds up by 3.7 times when running on an Nvidia A100 GPU. This innovation allows LLMs to handle long conversations and extensive documents more effectively, making advanced language technologies more practical and accessible in everyday applications.

Video

Chat is not available.