HyPER: Bridging Exploration and Exploitation for Scalable LLM Reasoning with Hypothesis Path Expansion and Reduction
Abstract
Scaling test-time compute with multi-path chain-of-thought can improve reasoning accuracy, but its gains hinge on an effective exploration–exploitation trade-off. Existing methods handle this trade-off in rigid ways: tree-structured search hard-codes exploration via brittle expansion rules that disrupt post-trained reasoning, while parallel reasoning over-explores redundant hypothesis paths and relies on a weak answer selection strategy. Driven by the insight that the optimal balance is phase-dependent and that correct vs. incorrect paths often diverge only at late stages, we reconceptualize test-time scaling as a dynamic expand–reduce control problem over a pool of hypothesis paths. We introduce HyPER, a training-free online control policy for MoE multi-path decoding that reallocates compute under a fixed budget using lightweight path statistics. HyPER features (i) an online controller that shifts from exploration to exploitation as the hypothesis pool evolves, (ii) an MoE-based token-level refinement primitive for efficient generation-time exploitation without full-path resampling, and (iii) a length- and confidence-aware aggregation rule to bridge the existence–selection gap for reliable answer-time exploitation. Extensive experimental results across four MoE models and diverse benchmarks demonstrate HyPER consistently achieves the accuracy–compute Pareto frontier, outperforming prior-art methods by 8-10% while reducing token consumption by 25-40%.