Gram2Token: Enabling Run-time GPU-Native Grammar-Constrained Decoding for LLMs
Hantao Hua ⋅ Jiming Su ⋅ hao tang ⋅ Yiping Yao ⋅ Feng Zhu
Abstract
Grammar-constrained decoding is essential for enabling large language models (LLMs) to efficiently generate structured outputs in applications, such as JSON objects for parameter passing. Existing approaches typically execute grammar constraint masking on the CPU, while LLM inference is performed on the GPU. This execution mismatch introduces frequent grammar-induced CPU $\rightarrow$ GPU control and data synchronization, leading to substantial overhead in large-batch inference. In contrast, we propose Gram2Token, which preprocesses grammar constraints into token-level representations that can be executed natively on GPUs at run time, thereby reducing decoding overhead. Specifically, Gram2Token first converts the input grammar into a pushdown automaton and aligns the automaton with tokenizer outputs via a trie. Through this alignment, pushdown stack configurations are encoded into a finite set of augmented grammar states, and tokens are categorized according to the grammar states in which they are valid. We further design a GPU-native grammar-constrained decoding pipeline that replaces complex run-time grammar parsing with $O(1)$ table lookups and eliminates run-time grammar-induced CPU $\rightarrow$ GPU control dependencies. Experimental results on large-batch JSON and SQL generation tasks show that, compared to state-of-the-art implementations, **Gram2Token improves decoding throughput by 1.5×–2.3×.** These results demonstrate that GPU-native grammar-constrained decoding is an effective and scalable approach for structured LLM generation.
Successful Page Load