Skip to yearly menu bar Skip to main content


Poster

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

Minsik Cho · Mohammad Rastegari · Devang Naik

Hall C 4-9
[ ]
Thu 25 Jul 2:30 a.m. PDT — 4 a.m. PDT

Abstract:

Large Language Model or LLM inference has twophases, the prompt (or prefill) phase to output thefirst token and the extension (or decoding) phaseto the generate subsequent tokens. In this work,we propose an efficient parallelization scheme,KV-Runahead to accelerate the prompt phase.The key observation is that the extension phasegenerates tokens faster than the prompt phase because of key-value cache (KV-cache). Hence,KV-Runahead parallelizes the prompt phase byorchestrating multiple processes to populate theKV-cache and minimizes the time-to-first-token(TTFT). Dual-purposing the KV-cache schemehas two main benefits. First, since KV-cache isdesigned to leverage the causal attention map, weminimize computation and computation automatically. Second, since it already exists for the exten-sion phase, KV-Runahead is easy to implement.We further propose context-level load-balancingto handle uneven KV-cache generation (due tothe causal attention) and to optimize TTFT. Compared with an existing parallelization scheme suchas tensor or sequential parallelization where keysand values are locally generated and exchangedvia all-gather collectives, our experimental resultsdemonstrate that KV-Runahead can offer over1.4× and 1.6× speedups for Llama 7B and Falcon7B respectively.

Live content is unavailable. Log in and register to view live content