Skip to yearly menu bar Skip to main content


Best of Both Worlds Policy Optimization

Christoph Dann · Chen-Yu Wei · Julian Zimmert

Ballroom C

Abstract: Policy optimization methods are popular reinforcement learning algorithms in practice and recent works have build theoretical foundation for them by proving $\sqrt{T}$ regret bounds even when the losses are adversarial. Such bounds are tight in the worst case but often overly pessimistic. In this work, we show that by carefully designing the regularizer, bonus terms, and learning rates, one can achieve a more favorable $\text{polylog}(T)$ regret bound when the losses are stochastic, without sacrificing the worst-case guarantee in the adversarial regime. Specifically, we show the first best of both worlds guarantee for policy optimization in tabular MDPs by leveraging either a Tsallis entropy or a Shannon entropy regularizer. Then we show that under known transitions, we can further obtain a first-order regret bound in the adversarial regime by leveraging the log barrier regularizer.

Chat is not available.