ICML Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

Poster
in
Workshop: ES-FoMo: Efficient Systems for Foundation Models

Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

Seongjun Yang · Gibbeum Lee · Jaewoong Cho · Dimitris Papailiopoulos · Kangwook Lee

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract: This paper presents “Predictive Pipelined Decoding (PPD),” a novel approach that speeds up greedy decoding in Large Language Models (LLMs) while maintaining the exact same output as the original decoding. Breaking from conventional strategies, PPD strategically employs additional compute resources to parallelize the initiation of subsequent token decoding during the ongoing verification of the current token decoding. This innovative method reduces decoding latency and reshapes the understanding of trade-offs in LLM decoding strategies. We have developed a theoretical framework that allows us to analyze the trade-off between computation and latency. Using this framework, we can estimate the potential reduction in latency associated with our proposed method, achieved through the assessment of the match rate, represented as

$p_{correct}$ . Our results demonstrate that the use of extra computational resources has the potential to significantly accelerate LLM greedy decoding.

Chat is not available.

Poster in Workshop: ES-FoMo: Efficient Systems for Foundation Models

Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

Seongjun Yang · Gibbeum Lee · Jaewoong Cho · Dimitris Papailiopoulos · Kangwook Lee

Poster
in
Workshop: ES-FoMo: Efficient Systems for Foundation Models