Real-world Reinforcement Learning from Suboptimal Interventions
Abstract
Real-world reinforcement learning (RL) offers a promising approach to training robotic manipulation policies through online interaction. While recent methods leverage human interventions to accelerate learning, they often assume interventions are consistently optimal or rely on offline filtering mechanisms that may discard valuable exploratory data. In practice, human operators exhibit varying performance across different states: operators may provide near-optimal guidance in familiar situations but struggle in novel or ambiguous states. The key challenge is how to selectively leverage heterogeneous intervention quality across states while maintaining the benefits of online exploration. To address this, we propose SiLRI, a state-wise Lagrangian RL algorithm that adaptively trades off between imitating interventions and maximizing future returns. We formulate online learning as a constrained optimization problem where constraint bounds vary across states according to estimated intervention uncertainty. This problem is then solved via state-wise Lagrangian relaxation, enabling the policy to selectively imitate interventions in high-confidence regions while relying more on RL exploration elsewhere. We evaluate SiLRI on nine real-world manipulation tasks using a human-as-copilot teleoperation system. Compared to HIL-SERL that treats interventions equally, SiLRI achieves at least 50% faster learning, effectively exploiting suboptimal human interventions without being constrained by them.