WAVE: Window-Aware Vocabulary-Efficient Early-Exit for Training-Free LLM Acceleration
Seonggeun Kim ⋅ Gilha lee ⋅ Hyun Kim
Abstract
Large language models (LLMs) incur substantial inference latency due to autoregressive decoding, in which each token requires a full forward pass through all transformer layers. Early-exit methods that terminate computation at intermediate layers offer a promising remedy, yet existing approaches suffer from fundamental limitations. Confidence-based methods rely on evaluating the full LM head at every layer, introducing considerable overhead that can negate the expected speedup. Schedule-based methods avoid this cost through predetermined exit schedules, but their monotonically decreasing layer allocation collapses to shallow layers, thereby constraining the maximum generation length. Learned exit predictors further require costly task-specific training and are vulnerable to distribution shifts in unseen domains. We propose Window-Aware Vocabulary-Efficient Early-Exit (WAVE), a training-free framework that addresses these challenges through two key innovations. First, exit window scheduling identifies an optimal layer range for early-exit decisions via offline calibration, preventing premature convergence to shallow layers while substantially reducing the number of exit checks. Second, a proxy LM head constructs a lightweight vocabulary subset at the window’s starting layer, reducing per-layer exit overhead by 87\% relative to full LM head. WAVE requires no gradient-based training and enables immediate deployment with only a brief calibration phase. Experiments on Llama-2 7B demonstrate up to 1.4$\times$ average speedup while preserving output quality, with full compatibility with W4A16 quantization, establishing WAVE as a practical early-exit framework for accelerating LLMs inference without retraining.
Successful Page Load