Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

Block Verification Accelerates Speculative Decoding

Ziteng Sun · Uri Mendlovic · Yaniv Leviathan · Asaf Aharoni · Ahmad Beirami · Jae Ro · Ananda Suresh

Project Page [ OpenReview]

Abstract

Speculative decoding is an effective method for lossless acceleration of large language models during inference. It uses a smaller model to draft a block of tokens which are verified in parallel by the large model, and provides a guarantee that the output is distributed identically to a sample from the large model. In prior works, draft verification is performed token-by-token independently. Surprisingly, we show that this approach is not optimal. We propose block verification, a simple, easy-to-implement improved draft verification algorithm that provides additional wall-clock speedup by verifying the entire block jointly. We prove that the proposed mechanism is optimal in the expected number of tokens produced each iteration and specifically is never worse than the standard token-level verification.Empirically, block verification provides modest but consistent wall-clock speedups over the standard token verification algorithm of 5\%-8\% in a wide range of tasks and datasets. Given that block verification does not increase code complexity, maintains the strong lossless guarantee of the standard speculative decoding verification algorithm, cannot deteriorate performance, and, in fact, consistently improves it, it can be used as a good default by speculative decoding implementations.

Chat is not available.