Abstract

Apprenticeship Learning by Inverse Reinforcement Learning
Pieter Abbeel - Stanford University Andrew Ng - Stanford University
We consider learning in a Markov decision process where we are not explicitlygiven a reward function, but where instead we can observe an expertdemonstrating the task that we want to learn to perform. This setting isuseful in applications (such as the task of driving) where it may be difficultto write down an explicit reward function specifying exactly how different desiderata should betraded off. We think of the expert as trying to maximize a reward function that isexpressible as a linear combination of known features, and give an algorithmfor learning the task demonstrated by the expert. Our algorithm is based onusing ``inverse reinforcement learning'' to try to recover the unknown rewardfunction. We show that our algorithm terminates in a small number of iterations, andthat even though we may never recover the expert's reward function, the policyoutput by the algorithm will attain performance close to that of the expert,where here performance is measured with respect to the expert's unknown rewardfunction.

Apprenticeship Learning by Inverse Reinforcement Learning

Pieter Abbeel - Stanford University
Andrew Ng - Stanford University

We consider learning in a Markov decision process where we are not explicitlygiven a reward function, but where instead we can observe an expertdemonstrating the task that we want to learn to perform. This setting isuseful in applications (such as the task of driving) where it may be difficultto write down an explicit reward function specifying exactly how different desiderata should betraded off. We think of the expert as trying to maximize a reward function that isexpressible as a linear combination of known features, and give an algorithmfor learning the task demonstrated by the expert. Our algorithm is based onusing ``inverse reinforcement learning'' to try to recover the unknown rewardfunction. We show that our algorithm terminates in a small number of iterations, andthat even though we may never recover the expert's reward function, the policyoutput by the algorithm will attain performance close to that of the expert,where here performance is measured with respect to the expert's unknown rewardfunction.