Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models
CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Model
Meguru Yamazaki · Shivaram Venkataraman
The widespread adoption of Large Language Models (LLMs) such as ChatGPT has highlighted significant challenges in inference cost management due to their autoregressive nature, requiring sequential token generation. KV cache has been introduced to mitigate recomputation costs during inference but at the expense of increased GPU memory usage, especially as input and output lengths grow. We introduce the Cumulative Observation Oracle (CO2), a novel approach to optimize KV cache replacement based on a sophisticated scoring system. Our method leverages an extended observation period, a decay mechanism for attention scores, and optimizing FIFO cache size adjustment to efficiently manage cache space and reduce overall memory demands. Evaluation on models such as the OPT-6.7B and the Llama2-7B demonstrates that CO2 significantly reduces memory usage while maintaining output quality, leading to 1.44x and 1.32x faster token generation throughput in the OPT-6.7B and the Llama2-7B, respectively.