ICML Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models

Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models

Siyan Zhao · Daniel Israel · Guy Van den Broeck · Aditya Grover

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

During inference for transformer-based LLMs, prefilling computes the key-value (KV) cache for prompt input tokens before autoregressive generation. This work highlights a pitfall of prefilling: for batches containing high-varying prompt lengths, significant computation is wasted by the standard practice of padding sequences to the maximum length. As LLMs support longer context lengths, variations in prompt lengths within a batch become more pronounced. To address this, we propose Prepacking, a simple yet effective method to optimize prefilling computation. Prepacking combines prompts of varying lengths into a sequence and packs multiple sequences into a compact batch using a bin-packing algorithm, then modifies the attention mask and positional encoding to compute multiple prefilled KV-caches within a single sequence. On standard datasets with varying prompt lengths, our method significantly improves speed and memory efficiency compared to default padding-based prefilling in Huggingface across various model configurations and inference scenarios.

Chat is not available.

Poster in Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models

Siyan Zhao · Daniel Israel · Guy Van den Broeck · Aditya Grover

Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models