Poster
in
Workshop: 2nd Workshop on Advancing Neural Network Training : Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024)
Accelerating Best-of-N via Speculative Rejection
Ruiqi Zhang · Momin Haider · Ming Yin · Jiahao Qiu · Mengdi Wang · Peter Bartlett · Andrea Zanette
The safe and effective deployment of Large Language Models (LLMs) often involves generating helpful and benign responses, producing easily comprehensible code, and crafting content with specific stylistic preferences. While different, these tasks share the common mathematical goal of generating responses from a language model with high scores according to a metric of interest.A popular and well known decoding strategy for this purpose is the Best-of-N method. The method generates a pre-specified number of responses (N) based on a prompt, and then selects the highest-scoring response among them to be returned. While Best-of-N is both simple and effective, its reliance on generating multiple responses to score for any given prompt incurs high inference costs.In this paper we make a first step towards accelerating the Best-of-N algorithm, by halting the generation of unpromising utterances, namely those that are unlikely to be returned by the algorithm upon completion. Focusing on the alignment problem, we show that this simple strategy allows to obtain substantial speedups for the Best-of-N algorithm with minimal performance degradation.