Poster Thu, Jul 9, 2026 • 2:30 PM – 4:15 PM KST Coex: HALL A

VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

Yuchen Xian ⋅ Yang He ⋅ Yunqiu Xu ⋅ Yi Yang

Abstract

Speculative decoding (SD) addresses the high inference costs of large language models (LLMs) by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected tokens can be verified correctly by a slim submodel derived from the full verifier via intra-model routing, instead of the full verifier. This motivates our slim-verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We propose Verification via Intra-Model Routing for Speculative Decoding (VIA-SD), a multi-tier framework using a routed slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across summarization, translation, reasoning, QA, and coding tasks on encoder-decoder and decoder-only model families, VIA-SD consistently lowers rejection rates (0.1–0.22) and achieves 10–20\% speedup over state-of-the-art SDs. Compared to decoding without drafting, VIA-SD provides 2.5-3× acceleration while improving output quality. Moreover, VIA-SD is compatible with existing SD frameworks without modifying their training procedures. Our results establish multi-tier SD as a general paradigm for scalable and efficient LLM inference. Our code will be publicly available.