Oral
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

Characterizing Prompt Compression Methods for Long Context Inference

Siddharth Jha · Lutfi Erdogan · Sehoon Kim · EECS Kurt Keutzer · Amir Gholaminejad

2024 Oral
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

Project Page [ OpenReview]

Abstract

Retrieval-augmented generation has become a popular paradigm to integrate custom data sources with large language models (LLMs).However, this often leads to large contexts of tens of thousands of tokens. Long context inference presents challenges at the system level with increased compute and memory requirements, as well as from an accuracy perspective in being able to reason over long contexts. This has led to prompt compression techniques that aim to reduce the size of provided context, while preserving key information. However, despite the wide variety of recently proposed methodologies for compressing long contexts, little standardized analysis has been done to analyze the behavior of different methods across compression rates and tasks. In this paper, we provide a comprehensive characterization and evaluation of prompt compression methods, giving insight into building compression techniques for long context applications. We analyze extractive compression, summarization-based abstractive compression, and token pruning methods. We find that extractive compression is a strong choice, often being able to compress over 10x with minimal accuracy loss. Token pruning demonstrates marginal improvements over extractive compression on summarization tasks. Furthermore, the performance of abstractive compression can be significantly enhanced, by up to 10 points in multi-document QA tasks at 30x compression, through the generation of query-aware summaries.

Video

Chat is not available.