Timezone: »

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations
Daniel Brown · Wonjoon Goo · Prabhat Nagarajan · Scott Niekum

Wed Jun 03:00 PM -- 03:05 PM PDT @ Hall B

A critical flaw of existing inverse reinforcement learning (IRL) methods is their inability to significantly outperform the demonstrator. This is a consequence of the general reliance of IRL algorithms upon some form of mimicry, such as feature-count matching, rather than inferring the underlying intentions of the demonstrator that may have been poorly executed in practice. In this paper, we introduce a novel reward learning from observation algorithm, Trajectory-ranked Reward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately) ranked demonstrations in order to infer high-quality reward functions from a set of potentially poor demonstrations. When combined with deep reinforcement learning, we show that this approach can achieve performance that is more than an order of magnitude better than the best-performing demonstration, as well as a state-of-the-art behavioral cloning from observation method, on multiple Atari and MuJoCo benchmark tasks. Finally, we demonstrate that T-REX is robust to modest amounts of ranking noise, opening up future possibilities for automating the ranking process, for example, by watching a learner noisily improve at a task over time.

Author Information

Daniel Brown (University of Texas at Austin)
Wonjoon Goo (University of Texas at Austin)
Prabhat Nagarajan (Preferred Networks)
Scott Niekum (University of Texas at Austin)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors