Constrained Offline Policy Optimization

Nicholas Polosky · Bruno C. da Silva · Madalina Fiterau · Jithin Jagannath

Hall E #910

Keywords: [ RL: Policy Search ] [ RL: Batch/Offline ]

[ Abstract ]
[ Poster [ Paper PDF
Wed 20 Jul 3:30 p.m. PDT — 5:30 p.m. PDT
Spotlight presentation: Reinforcement Learning
Wed 20 Jul 1:30 p.m. PDT — 3:05 p.m. PDT


In this work we introduce Constrained Offline Policy Optimization (COPO), an offline policy optimization algorithm for learning in MDPs with cost constraints. COPO is built upon a novel offline cost-projection method, which we formally derive and analyze. Our method improves upon the state-of-the-art in offline constrained policy optimization by explicitly accounting for distributional shift and by offering non-asymptotic confidence bounds on the cost of a policy. These formal properties are superior to those of existing techniques, which only guarantee convergence to a point estimate. We formally analyze our method and empirically demonstrate that it achieves state-of-the-art performance on discrete and continuous control problems, while offering the aforementioned improved, stronger, and more robust theoretical guarantees.

Chat is not available.