Poster Tue, Jul 7, 2026 • 6:30 PM – 8:15 PM PDT HALL A #505

RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization

LIU SONGMING ⋅ Bangguo Li ⋅ Kai Ma ⋅ Lingxuan Wu ⋅ Hengkai Tan ⋅ Xiao Ouyang ⋅ Hang Su ⋅ Jun Zhu

Project Page

Abstract

Vision-Language-Action (VLA) models hold promise for generalist robotics but currently struggle with data scarcity, architectural inefficiencies, and the inability to generalize across different hardware platforms. We introduce RDT2, a robotic foundation model built upon a 7B parameter VLM designed to enable zero-shot deployment on novel embodiments for open-vocabulary tasks. To achieve this, we collected one of the largest open-source robotic datasets—over $10,000$ hours of demonstrations in diverse families—using an enhanced, embodiment-agnostic Universal Manipulation Interface (UMI). Our approach employs a novel three-stage training recipe that aligns discrete linguistic knowledge with continuous control via Residual Vector Quantization (RVQ), flow-matching, and distillation for real-time inference. Consequently, RDT2 becomes one of the first models that simultaneously zero-shot generalizes to unseen objects, scenes, instructions, and even robotic platforms. Besides, it outperforms state-of-the-art baselines in dexterous, long-horizon, and dynamic downstream tasks like playing table tennis.

Lay Summary

Robots are still difficult to train for everyday tasks because most systems only work well in the specific settings and on the specific robot hardware they were trained on. This makes it costly to adapt a robot to new homes, objects, instructions, or robot arms. In this work, we introduce RDT2, a general-purpose robot model designed to follow language instructions and control different robots without needing new training for every new situation. To build it, we collected over 10,000 hours of real-world demonstrations using a portable hand-held device, allowing people to teach many household manipulation skills in diverse environments. We also design a training process that helps the model connect what it sees and hears with precise robot movements while still running fast enough for real-time control. Experiments show that RDT2 can handle new objects, scenes, instructions, and robot platforms, and performs strongly on challenging tasks such as cloth folding, table bussing, button pressing, and table tennis. This suggests a practical path toward more adaptable robot assistants that can be deployed more broadly with less task-specific data collection.