A Mechanistic Understanding of Sim-and-Real Co-Training in Generative Policies
Abstract
Co-training, which combines limited in-domain real-world data with abundant surrogate data such as simulation or cross-embodiment demonstrations, has been widely adopted for training generative visuomotor robot policies. Despite its empirical success, the mechanisms underlying when and why co-training works remain poorly understood. Starting from theoretical analysis and a toy example, we identify these two key intrinsic factors for end-to-end co-training systems: balanced mixing ratio" andstructured representation alignment". We propose an explanation that when simulation and real-world data are combined with a balanced mixing ratio, co-training naturally learns representations that are aligned across domains while remaining domain-distinguishable, enabling effective knowledge transfer without sacrificing real-world adaptation, which we refer to as structured representation alignment. We validate the hypothesis with comprehensive sim-and-sim and sim-and-real robotic experiments, showing that structured representation alignment reliably emerges under balanced mixing ratios and largely determines downstream performance. Benchmarking several recent co-training methods further supports this explanation. Guided by our analysis, we propose a simple combination of co-training techniques that jointly promote alignment and domain discernibility, achieving substantial improvements over prior approaches.