RoboOmni: Actions Are Just Another Modality for Your Vision-Language Models
Dong Wang ⋅ Zilong Chen ⋅ Jirong Liu ⋅ Ziqing Qiao ⋅ Xin Xiao ⋅ Bingyi Kang ⋅ Hongtao Wu ⋅ Xiao Ma ⋅ Tao Kong ⋅ Huaping Liu
Abstract
Integrating Vision-Language Models (VLMs) into robotics has facilitated the development of generalizable Vision-Language Action (VLA) policies. However, unified discrete frameworks lag behind decoupled continuous designs due to limitations in action chunking and temporal modeling. To address this, we introduce **RoboOmni**, a unified multi-modal next-token prediction framework. Challenging the assumption that continuous modeling is essential for high-performance manipulation, **RoboOmni** demonstrates that *actions are just another modality* capable of being effectively modeled discretely. At the core of our method is Multi-Token Action Prediction (MTAP), which integrates action chunking directly into the discrete tokenizer. This design resolves temporal modeling bottlenecks and significantly reduces distribution shift between training and inference. By preserving the native VLM training and inference pipeline, **RoboOmni** naturally benefits from large-scale multimodal co-training and modern decoding optimizations. Extensive evaluations on the CALVIN, SimplerEnv, and real-world platforms confirm that **RoboOmni** establishes new state-of-the-art performance, significantly outperforming diffusion-based baselines such as $\pi_0$. Notably, combining our proposed MTAP with the FAST tokenizer achieves a 94.4\% average success rate on CALVIN, while the Bin tokenizer implementation attains a 27$\times$ inference speedup compared to OpenVLA.
Successful Page Load