Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Spurious correlations, Invariance, and Stability (SCIS)

A Study of Causal Confusion in Preference-Based Reward Learning

Jeremy Tien · Zhiyang He · Zackory Erickson · Anca Dragan · Daniel S Brown

Keywords: [ causal reward confusion ] [ preference learning ] [ reward learning ]


Abstract:

There has been a recent growth of anecdotal evidence that learning reward functions from preferences is prone to spurious correlations, leading to reward hacking behaviors. While there is much empirical and theoretical analysis of causal confusion and reward gaming behaviors in reinforcement learning and behavioral cloning approaches, we provide the first systematic study of causal confusion in the context of learning reward functions from preferences. We identify a set of three benchmark domains where we observe causal confusion when learning reward functions from offline datasets of pairwise trajectory preferences: a simple reacher domain, an assistive feeding domain, and an itch-scratching domain. To gain insight into this observed causal confusion, we perform a sensitivity analysis on the effect of different factors---the reward model capacity and feature dimensionality---on the robustness of rewards learned from preferences. We find evidence that learning rewards from preferences is highly sensitive and non-robust to spurious features and increasing model capacity.

Chat is not available.