Skip to yearly menu bar Skip to main content


Poster

Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation

Zhihao Zhang · Alan Zhu · Lijie Yang · Yihua Xu · Lanting Li · Phitchaya Phothilimthana · Zhihao Jia


Abstract:

Retrieval-augmented language models (RaLM)have demonstrated the potential to solveknowledge-intensive natural language processing (NLP) tasks by combining a non-parametricknowledge base with a parametric languagemodel. Instead of fine-tuning a fully parametricmodel, RaLM excels at its low-cost adaptation tothe latest data and better source attribution mechanisms. Iterative RaLM in particular delivers better generation quality using more frequent interactions between the retriever and the languagemodel at the cost of high retrieval overhead. To alleviate this, we propose RaLMSpec, a speculation-inspired framework that provides generic speed-up over iterative RaLM while preserving the samemodel outputs through speculative retrieval andbatched verification. By further incorporatingprefetching, optimal speculation stride scheduler,and asynchronous verification, RaLMSpec canautomatically exploit the acceleration potentialto the fullest. For document-level iterative RaLM serving,extensive evaluations over three language models on four downstream QA datasets demonstratethat RaLMSpec can achieve a speed-up ratio of1.75-2.39×, 1.04-1.39×, and 1.31-1.77× whenthe retriever is an exact dense retriever, approximate dense retriever, and sparse retriever respectively compared with the baseline. For token-level iterative RaLM (KNN-LM)serving, RaLMSpec can achieve a speed-up ratio of up to 7.59× and 2.45× when the retriever isan exact dense and approximate dense retriever,respectively, compared with the baseline.

Live content is unavailable. Log in and register to view live content