Timezone: »
Reward hacking occurs when an agent exploits its specified reward function to behave in undesirable or unsafe ways. Aside from better alignment between the specified reward function and the system designer's intentions, a more feasible proposal to prevent reward hacking is to regularize the learned policy to some safe baseline. Current research suggests that regularizing the learned policy's action distributions to be more similar to those of a safe policy can mitigate reward hacking; however, this approach fails to take into account the disproportionate impact that some actions have on the agent’s state. Instead, we propose a method of regularization based on occupancy measures, which capture the proportion of time each policy is in a particular state-action pair during trajectories. We show theoretically that occupancy-based regularization avoids many drawbacks of action distribution-based regularization, and we introduce an algorithm called ORPO to practically implement our technique. We then empirically demonstrate that occupancy measure-based regularization is superior in both a simple gridworld and a more complex autonomous vehicle control environment.
Author Information
Cassidy Laidlaw (University of California Berkeley)

I’m a third-year PhD student studying computer science at the University of California, Berkeley. I’m interested in human-AI cooperation, reinforcement learning theory, and robustness and uncertainty in machine learning. I received my BS in computer science and mathematics from the University of Maryland in 2018. My PhD is currently funded by a National Defense Science and Engineering Graduate (NDSEG) Fellowship and I am also the recipient of an Open Phil AI Fellowship.
Shivam Singhal (University of California, Berkeley)
Anca Dragan (University of California, Berkeley)
Related Events (a corresponding poster, oral, or spotlight)
-
2023 : Preventing Reward Hacking with Occupancy Measure Regularization »
Dates n/a. Room
More from the Same Authors
-
2022 : A Study of Causal Confusion in Preference-Based Reward Learning »
Jeremy Tien · Zhiyang He · Zackory Erickson · Anca Dragan · Daniel S Brown -
2023 : Preventing Reward Hacking with Occupancy Measure Regularization »
Cassidy Laidlaw · Shivam Singhal · Anca Dragan -
2023 : Video-Guided Skill Discovery »
Manan Tomar · Dibya Ghosh · Vivek Myers · Anca Dragan · Matthew Taylor · Philip Bachman · Sergey Levine -
2023 Workshop: Interactive Learning with Implicit Human Feedback »
Andi Peng · Akanksha Saran · Andreea Bobu · Tengyang Xie · Pierre-Yves Oudeyer · Anca Dragan · John Langford -
2023 : Bridging RL Theory and Practice with the Effective Horizon »
Cassidy Laidlaw · Stuart Russell · Anca Dragan -
2023 : Learning Optimal Advantage from Preferences and Mistaking it for Reward »
William Knox · Stephane Hatgis-Kessell · Sigurdur Adalgeirsson · Serena Booth · Anca Dragan · Peter Stone · Scott Niekum -
2023 Poster: Contextual Reliability: When Different Features Matter in Different Contexts »
Gaurav Ghosal · Amrith Setlur · Daniel S Brown · Anca Dragan · Aditi Raghunathan -
2023 Poster: Automatically Auditing Large Language Models via Discrete Optimization »
Erik Jones · Anca Dragan · Aditi Raghunathan · Jacob Steinhardt -
2022 Poster: Estimating and Penalizing Induced Preference Shifts in Recommender Systems »
Micah Carroll · Anca Dragan · Stuart Russell · Dylan Hadfield-Menell -
2022 Spotlight: Estimating and Penalizing Induced Preference Shifts in Recommender Systems »
Micah Carroll · Anca Dragan · Stuart Russell · Dylan Hadfield-Menell -
2022 : Learning to interact: PARTIAL OBSERVABILITY + GAME Theory of mind on steroids »
Anca Dragan -
2022 : Learning to interact: PARTIAL OBSERVABILITY The actions you take as part of the task are the queries! »
Anca Dragan -
2022 : Q&A »
Dorsa Sadigh · Anca Dragan -
2022 Tutorial: Learning for Interactive Agents »
Dorsa Sadigh · Anca Dragan -
2022 : Learning objectives and preferences: WHAT DATA? From diverse types of human data »
Anca Dragan -
2021 Poster: Policy Gradient Bayesian Robust Optimization for Imitation Learning »
Zaynah Javed · Daniel Brown · Satvik Sharma · Jerry Zhu · Ashwin Balakrishna · Marek Petrik · Anca Dragan · Ken Goldberg -
2021 Spotlight: Policy Gradient Bayesian Robust Optimization for Imitation Learning »
Zaynah Javed · Daniel Brown · Satvik Sharma · Jerry Zhu · Ashwin Balakrishna · Marek Petrik · Anca Dragan · Ken Goldberg -
2021 Poster: Value Alignment Verification »
Daniel Brown · Jordan Schneider · Anca Dragan · Scott Niekum -
2021 Spotlight: Value Alignment Verification »
Daniel Brown · Jordan Schneider · Anca Dragan · Scott Niekum -
2020 : Invited Talk 7: Prof. Anca Dragan from UC Berkeley »
Anca Dragan -
2020 : "Active Learning through Physically-embodied, Synthesized-from-“scratch” Queries" »
Anca Dragan -
2020 Poster: Learning Human Objectives by Evaluating Hypothetical Behavior »
Siddharth Reddy · Anca Dragan · Sergey Levine · Shane Legg · Jan Leike -
2019 Poster: On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference »
Rohin Shah · Noah Gundotra · Pieter Abbeel · Anca Dragan -
2019 Oral: On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference »
Rohin Shah · Noah Gundotra · Pieter Abbeel · Anca Dragan -
2019 Poster: Learning a Prior over Intent via Meta-Inverse Reinforcement Learning »
Kelvin Xu · Ellis Ratner · Anca Dragan · Sergey Levine · Chelsea Finn -
2019 Oral: Learning a Prior over Intent via Meta-Inverse Reinforcement Learning »
Kelvin Xu · Ellis Ratner · Anca Dragan · Sergey Levine · Chelsea Finn -
2018 Poster: An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning »
Dhruv Malik · Malayandi Palaniappan · Jaime Fisac · Dylan Hadfield-Menell · Stuart Russell · Anca Dragan -
2018 Oral: An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning »
Dhruv Malik · Malayandi Palaniappan · Jaime Fisac · Dylan Hadfield-Menell · Stuart Russell · Anca Dragan