Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch
Abstract
Deterministic inference is increasingly critical for large language model (LLM) applications such as LLM-as-a-judge evaluation, multi-agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs. While prior work has addressed batch-size–related nondeterminism through batch-invariant kernels, determinism across different TP sizes remains an open problem, particularly in RL settings, where the training engine typically uses Fully Sharded Data Parallel (FSDP) (i.e., TP = 1) while the rollout engine relies on multi-GPU TP to maximize the inference throughput, creating a natural mismatch between the two. This precision mismatch problem may lead to suboptimal performance or even collapse for RL training. We identify and analyze the root causes of TP-induced inconsistency and propose Tree-Based Invariant Kernels (TBIK), a set of custom matrix multiplication and reduction kernels that guarantee bit-wise identical results regardless of TP size. Our key insight is to enforce a consistent reduction order across and within GPUs. We implement TBIK in Triton and integrate it into vLLM and FSDP, achieving bit-wise deterministic inference across different TP sizes and zero probability divergence between vLLM and FSDP in RL training pipelines. This eliminates the numerical mismatch caused by different parallel strategies, enabling true on-policy RL at a large scale for the first time.