Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing
Raghavv Goel ⋅ Mukul Gagrani ⋅ Mingu Lee ⋅ Christopher Lott
Abstract
Large Language Models (LLMs) possess latent multi-token prediction (MTP) capabilities despite being trained only for next-token generation. We introduce a simple and training-free MTP method that probes an LLM using on-the-fly mask tokens derived from its embedding space, enabling parallel future-token prediction without modifying weights or relying on draft models. We construct a speculative token tree by sampling Top-$K$ candidates from mask-token logits and apply a lightweight pruning rule to retain high-probability continuations. During generation, predicted tokens are verified in parallel, yielding lossless decoding while significantly reducing the number of model calls and increasing token throughput. Our probing-based MTP method consistently outperforms existing training-free baselines, improving acceptance length by approximately $12\\%$ on LLaMA3 and $8$–$12\\%$ on Qwen3, and increasing throughput by up to $15$–$19\\%$. We further provide theoretical analysis and empirical evidence showing that decoder layers naturally align mask-token representations with next-token states, enabling accurate multi-step predictions without retraining or auxiliary models.
Successful Page Load