Timezone: »

A Study of Causal Confusion in Preference-Based Reward Learning
Jeremy Tien · Zhiyang He · Zackory Erickson · Anca Dragan · Daniel S Brown
Event URL: https://openreview.net/forum?id=WaZZ0Sw9fWf »

There has been a recent growth of anecdotal evidence that learning reward functions from preferences is prone to spurious correlations, leading to reward hacking behaviors. While there is much empirical and theoretical analysis of causal confusion and reward gaming behaviors in reinforcement learning and behavioral cloning approaches, we provide the first systematic study of causal confusion in the context of learning reward functions from preferences. We identify a set of three benchmark domains where we observe causal confusion when learning reward functions from offline datasets of pairwise trajectory preferences: a simple reacher domain, an assistive feeding domain, and an itch-scratching domain. To gain insight into this observed causal confusion, we perform a sensitivity analysis on the effect of different factors---the reward model capacity and feature dimensionality---on the robustness of rewards learned from preferences. We find evidence that learning rewards from preferences is highly sensitive and non-robust to spurious features and increasing model capacity.

Author Information

Jeremy Tien (University of California, Berkeley)
Zhiyang He (UC Berkeley)
Zackory Erickson (Carnegie Mellon University)
Anca Dragan (University of California, Berkeley)
Daniel S Brown (University of Texas at Austin)

More from the Same Authors