Timezone: »

Poster
Identifying the Reward Function by Anchor Actions
Sinong Geng · Houssam Nassif · Charlie Manzanares · Max Reppen · Ronnie Sircar

Wed Jul 15 08:00 AM -- 08:45 AM &amp; Wed Jul 15 07:00 PM -- 07:45 PM (PDT) @ None #None
We propose a reward function estimation framework for inverse reinforcement learning with deep energy-based policies. We name our method PQR, as it sequentially estimates the Policy, the $Q$-function, and the Reward function. PQR does not assume that the reward solely depends on the state, instead it allows for a dependency on the choice of action. Moreover, PQR allows for stochastic state transitions. To accomplish this, we assume the existence of one anchor action whose reward is known, typically the action of doing nothing, yielding no reward. We present both estimators and algorithms for the PQR method. When the environment transition is known, we prove that the PQR reward estimator uniquely recovers the true reward. With unknown transitions, we bound the estimation error of PQR. Finally, the performance of PQR is demonstrated by synthetic and real-world datasets.

#### Author Information

##### Charlie Manzanares (Amazon)

I manage the Amazon Prime Economics team, which is comprised of a WW team of economists (US, EU, JP, and India), applied scientists, data and software engineers, and product managers. We develop customized econometric and machine learning demand and pricing models for Amazon Prime, which includes Prime Video, Prime Music, and free or discounted shipping on millions of goods, among other benefits. We use these tools to study the counterfactual effects of proposed price changes, benefit bundling strategies, new benefits, and alternative membership structures on Prime demand in Prime's growing number of international marketplaces. As a complement to our internally-facing work, our team has produced scientific publications proposing new methods for scalably solving challenging dynamic inverse reinforcement learning problems.