LEAP: Zone-Aware MCTS for LLM Self-Speculative Decoding
LeiQuan Zheng ⋅ Yuan Liu
Abstract
Self-speculative decoding accelerates LLM inference by using a lightweight draft model for generation and a target model for verification, where the draft model is constructed by a subset of the target model’s layers, and the key challenge lies in layer configuration strategies. To address this challenge, we propose LEAP, a plug-and-play approach that formulates and optimizes the draft model construction problem as a sequential decision-making process by Monte Carlo Tree Search (MCTS). To navigate the prohibitive search space of deep LLMs, we leverage two empirical observations: (i) the prefilling-derived redundancy information remains informative during decoding, and (ii) the layer redundancy exhibits zone-wise characteristics. These observations enable a structured search space through zone partitioning and layer grouping, which serves as an inductive bias to facilitate efficiency of MCTS. Experimental results show that LEAP achieves a speedup of $1.7\times\sim2.0\times$ for LLM inference.
Successful Page Load