Timezone: »

Robust Policy Gradient against Strong Data Corruption
Xuezhou Zhang · Yiding Chen · Jerry Zhu · Wen Sun

Wed Jul 21 09:00 PM -- 11:00 PM (PDT) @ None #None
We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions. Our attack model assumes an \textit{adaptive} adversary who can arbitrarily corrupt the reward and transition at every step within an episode, for at most $\epsilon$-fraction of the learning episodes. Our attack model is strictly stronger than those considered in prior works. Our first result shows that no algorithm can find a better than $O(\epsilon)$-optimal policy under our attack model. Next, we show that surprisingly the natural policy gradient (NPG) method retains a natural robustness property if the reward corruption is bounded, and can find an $O(\sqrt{\epsilon})$-optimal policy. Consequently, we develop a Filtered Policy Gradient (FPG) algorithm that can tolerate even unbounded reward corruption and can find an $O(\epsilon^{1/4})$-optimal policy. We emphasize that FPG is the first that can achieve a meaningful learning guarantee when a constant fraction of episodes are corrupted. Complimentary to the theoretical results, we show that a neural implementation of FPG achieves strong robust learning performance on the MuJoCo continuous control benchmarks.

Author Information

Xuezhou Zhang (UW-Madison)
Yiding Chen (University of Wisconsin-Madison)
Jerry Zhu (University of Wisconsin-Madison)
Wen Sun (Cornell University)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors