Timezone: »

Mind the Gap: Safely Bridging Offline and Online Reinforcement Learning
Wanqiao Xu · Kan Xu · Hamsa Bastani · Osbert Bastani

A key challenge to deploying reinforcement learning in practice is exploring safely. We propose a natural safety property---uniformly outperforming a conservative policy (adaptively estimated from all data observed thus far), up to a per-episode exploration budget. This property formalizes the idea that we should spread out exploration to avoid taking actions significantly worse than the ones that are currently known to be good. We then design an algorithm that uses a UCB reinforcement learning policy for exploration, but overrides it as needed to ensure safety with high probability. To ensure exploration across the entire state space, it adaptively determines when to explore (at different points of time across different episodes) in a way that allows "stitching" sub-episodes together to obtain a meta-episode that is equivalent to using UCB for the entire episode. Then, we establish reasonable assumptions about the underlying MDP under which our algorithm is guaranteed to achieve sublinear regret while ensuring safety; under these assumptions, the cost of imposing safety is only a constant factor.

Author Information

Wanqiao Xu (University of Michigan)
Kan Xu (University of Pennsylvania)
Hamsa Bastani (Wharton)
Osbert Bastani (University of Pennsylvania)

More from the Same Authors