Cache Coherent Resampling for Efficient Test Time Scaling in LLM Reasoning via Adaptive Sequential Monte Carlo
Ke Wang ⋅ ZEHAO Yu ⋅ Luwei Wang ⋅ Yongchao Huang
Abstract
Recent work shows that chain based sampling for power shaped trajectory distributions can deliver large test time gains from a fixed base LLM and can approach RL trained reasoners such as GRPO. Deployment is the bottleneck. Autoregressive Metropolis Hastings is inherently serial, limits GPU utilization, and exhibits extreme tail latency at high budgets, reaching p95 $=1318$s on MATH500 at $128\times$. We propose Adaptive Sequential Monte Carlo (ASMC), a parallel particle inference method that targets power shaped trajectory distributions while adapting particle populations to problem hardness. To make resampling practical for Transformers, we introduce cache coherent resampling, which realizes ancestry updates by reordering KV caches and other particle bound tensors, avoiding prefix recomputation. On MATH500 at the same budget, ASMC attains $80.6\%$ exact match accuracy with p95 $=73.7$s, improving the accuracy to tail latency trade off over both sequential MCMC and best of $n$. We further analyze particle degeneracy and find that collapse severity, measured by low $\mathrm{ESS}_{\min}/N$, strongly predicts failures, while sensitivity to the resampling scheme is limited.
Successful Page Load