Timezone: »

Preventing Reward Hacking with Occupancy Measure Regularization
Cassidy Laidlaw · Shivam Singhal · Anca Dragan
Event URL: https://openreview.net/forum?id=oiT8js6p3Z »

Reward hacking occurs when an agent exploits its specified reward function to behave in undesirable or unsafe ways. Aside from better alignment between the specified reward function and the system designer's intentions, a more feasible proposal to prevent reward hacking is to regularize the learned policy to some safe baseline. Current research suggests that regularizing the learned policy's action distributions to be more similar to those of a safe policy can mitigate reward hacking; however, this approach fails to take into account the disproportionate impact that some actions have on the agent’s state. Instead, we propose a method of regularization based on occupancy measures, which capture the proportion of time each policy is in a particular state-action pair during trajectories. We show theoretically that occupancy-based regularization avoids many drawbacks of action distribution-based regularization, and we introduce an algorithm called ORPO to practically implement our technique. We then empirically demonstrate that occupancy measure-based regularization is superior in both a simple gridworld and a more complex autonomous vehicle control environment.

Author Information

Cassidy Laidlaw (University of California Berkeley)
Cassidy Laidlaw

I’m a third-year PhD student studying computer science at the University of California, Berkeley. I’m interested in human-AI cooperation, reinforcement learning theory, and robustness and uncertainty in machine learning. I received my BS in computer science and mathematics from the University of Maryland in 2018. My PhD is currently funded by a National Defense Science and Engineering Graduate (NDSEG) Fellowship and I am also the recipient of an Open Phil AI Fellowship.

Shivam Singhal (University of California, Berkeley)
Anca Dragan (University of California, Berkeley)

More from the Same Authors