SafeLab: An Interactive High-Fidelity Benchmark for Embodied Safety in Scientific Robotics
Abstract
Laboratory automation driven by scientific embodied agents represents a critical frontier in modern laboratories. Unlike conventional robotic domains, laboratory environments impose zero-tolerance constraints on manipulation precision and collision, as minor deviations can lead to irreversible chemical hazards or equipment damage. This naturally makes the automated laboratory an ideal testbed for advancing embodied safety. However, existing benchmarks predominantly feature high-tolerance manipulation tasks where intermediate failures are largely reversible. More critically, current Vision-Language-Action (VLA) models trained via static imitation learning cannot satisfy these strict constraints. Because they merely mimic successful demonstrations, they lack the ability to recover from execution drift, leading to catastrophic compounding errors in precision-critical domains. Overcoming this limitation requires transitioning from static datasets to interactive environments that support Reinforcement Learning (RL) for dynamic error recovery. To this end, we introduce SafeLab, a generative simulation benchmark designed for the full lifecycle of safe robot learning. Grounded in a high-fidelity chemistry lab, our framework integrates an {LLM engine} for procedural task synthesis, an {automated expert} for scalable demonstration collection, and an {interactive environment} for continuous RL refinement. Leveraging this infrastructure, we release a dataset of 6,000+ {complex trajectories} to evaluate state-of-the-art VLA models. Experiments reveal that current embodied agents fail significantly under these safety constraints. In contrast, our RL post-training pipeline enables agents to learn active error correction, mitigating hazardous failures and improving success rates by 37\%, thereby establishing SafeLab as a critical platform for developing reliable and safe generalist agents.