Abstract:
This paper introduces RaLMSpec, a framework that accelerates iterative retrieval-augmented language model (RaLM) with *speculative retrieval* and *batched verification*. RaLMSpec further introduces several important systems optimizations, including prefetching, optimal speculation stride scheduler, and asynchronous verification. The combination of these techniques allows RaLMSPec to significantly outperform existing systems. For document-level iterative RaLM serving, evaluation over three LLMs on four QA datasets shows that RaLMSpec improves over existing approaches by 1.75-2.39×, 1.04-1.39×, and 1.31-1.77× when the retriever is an exact dense retriever, approximate dense retriever, and sparse retriever respectively. For token-level iterative RaLM (KNN-LM) serving, RaLMSpec is up to 7.59× and 2.45× faster than existing methods for exact dense and approximate dense retrievers, respectively.
Chat is not available.