RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
Abstract
Simulation-based data synthesis has emerged as a powerful paradigm for enhancing real-world robotic manipulation. However, existing synthetic datasets remain insufficient for robust bimanual manipulation due to two key challenges: (1) the lack of an autonomous self-correcting mechanism to resolve execution failures in complex coordination tasks, and (2) the scarcity of diverse visual and spatial variations required to bridge the sim-to-real gap. To this end, we present RoboTwin 2.0, a scalable simulation framework that enables closed-loop, automated, large-scale generation of diverse and realistic data, along with unified evaluation protocols for dual-arm manipulation. Built upon RoboTwin-OD, a foundational library of 731 instances across 147 categories with rich semantic annotations, our framework integrates Multimodal Large Language Models (MLLMs) with simulation-in-the-loop verification. This integration forms an automated feedback mechanism that significantly boosts the success rate of expert task program generation. To enhance robust sim-to-real transfer, RoboTwin 2.0 incorporates structured domain randomization along five axes: clutter, lighting, background, tabletop height and language instructions, thereby maximizing data diversity. We instantiate this framework across 50 dual-arm tasks spanning five robot embodiments. Empirical evaluations demonstrate that Vision-Language-Action (VLA) models pre-trained on our synthetic data achieve a 3.6x improvement in few-shot real-world transfer (over a 10-demo baseline) and a 2.2x gain in zero-shot generalization. We release the data generator, benchmark, pre-collected dataset, and code to support scalable research in robust bimanual manipulation.