Beyond Logits: Metastable Latent Dynamics for Sample-Efficient Best-of-N Selection in LLMs
Abstract
Best-of-N selection improves reasoning in large language models (LLMs) by allocating additional test-time compute to sample multiple candidate trajectories, but it fundamentally relies on reliable verification. However, widely used proxies based on logit confidence or sample agreement can suffer from calibration collapse, where confidence becomes misaligned with correctness. Instead, we move beyond output-level signals and analyze the model's latent dynamics during inference. Drawing from cognitive neuroscience, we hypothesize that effective reasoning exhibits \textit{metastability}—a balance between stability and flexibility manifested as structured ``dwell-and-jump'' dynamics. We introduce Latent Velocity Entropy (LVE), a training-free metric that quantifies these dynamics via the entropy of internal representation updates. Extensive experiments on four reasoning benchmarks (AIME, GPQA, MATH, Brumo) demonstrate that the metric mitigates calibration collapse and consistently outperforms leading logit-based baselines. It surpasses the state-of-the-art baseline (UID) by 1.6\% and majority voting by 4.0\% in average accuracy. Remarkably, our method matches the performance of 10-sample majority voting using only 3 samples—a 70\% reduction in inference cost.