Timezone: »

Defining and Characterizing Reward Gaming
Joar Skalse · Nikolaus Howe · Dmitrii Krasheninnikov · David Krueger
We provide the first formal definition of \textbf{reward gaming}, a phenomenon where optimizing an imperfect \emph{proxy reward function}, $\mathcal{\tilde{R}}$, leads to poor performance according to a true reward function, $\mathcal{R}$. We say that a proxy is \emph{ungameable} if increasing the expected proxy return can never decrease the expected true return. Intuitively, it should be possible to create an ungameable proxy by overlooking fine-grained distinctions between roughly equivalent outcomes, but we show this is usually not the case. A key insight is that the linearity of reward (as a function of state-action visit counts) makes ungameability a very strong condition. In particular, for the set of all stochastic policies, two reward functions can only be ungameable if one of them is constant. We thus turn our attention to deterministic policies and finite sets of stochastic policies, where non-trivial ungameable pairs always exist, and establish necessary and sufficient conditions for the existence of simplifications, an important special case of ungameability. Our results reveal a tension between using reward functions to specify narrow tasks and aligning AI systems with human values.

Author Information

Joar Skalse (University of Oxford)
Nikolaus Howe (Mila, University of Montreal)
Dmitrii Krasheninnikov (University of Cambridge)

I am a Research Engineer working on deep RL and robotics at Sony AI. This fall I’ll be starting a PhD at the University of Cambridge, where I’ll work on technical AI safety with prof. David Krueger.

David Krueger (MILA (University of Montreal))

More from the Same Authors