Drift is a Sampling Error: SNR-Aware Power Distributions for Long-Horizon Robotic Planning
Abstract
Despite rapid progress in Vision-Language-Action (VLA) models for robotic control, instruction drift remains a persistent failure mode in long-horizon tasks. This paper reconceptualizes this phenomenon, positing that instruction drift is fundamentally a systematic sampling error: local greedy sampling is prone to collapsing into ''Negative Pivotal Windows''—irreversible local optima with high local probability that sever global success pathways. To address this, we propose \textbf{Context-Aware Power Sampling (CAPS)}, a training-free inference-time computation framework. CAPS leverages power distributions to sharpen global trajectory probabilities, effectively activating the model's implicit world model for lookahead planning. Furthermore, we introduce a metacognitive control mechanism based on Signal-to-Noise Ratio (SNR). This mechanism triggers adaptive MCMC search solely when drift risk is detected, enabling a dynamic transition from ''intuitive fast thinking'' to ''rational slow search.'' Experiments on RoboTwin, Simpler-WindowX, and Libero-long benchmarks demonstrate that CAPS significantly outperforms SOTA baselines, such as OpenVLA and TACO, without parameter updates. These results confirm that adaptive inference-time computation is a potent pathway to enhancing embodied long-horizon robustness.