Mixture of Horizons in Action Chunking
Dong Jing ⋅ Gang Wang ⋅ Jiaqi Liu ⋅ Weiliang Tang ⋅ Zelong Sun ⋅ Yunchao Yao ⋅ Zhenyu Wei ⋅ Yunhui Liu ⋅ Zhiwu Lu ⋅ Mingyu Ding
Abstract
Vision-language-action models exhibit an inherent trade-off in action chunk length (``horizon''): longer horizons improve global foresight but degrade fine-grained local control, while shorter ones yield the opposite. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. In brief, MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It offers three appealing benefits. i) Long-term foresight and short-term precision are jointly exploited within a single model. ii) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. iii) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based and one-step regression policies demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $\pi_{0.5}$ with MoH reaches a new state-of-the-art with 99\% average success rate on LIBERO after only $30k$ training iterations.
Successful Page Load