Timezone: »

Reinforcement Learning Can Be More Efficient with Multiple Rewards
Christoph Dann · Yishay Mansour · Mehryar Mohri

Wed Jul 26 05:00 PM -- 06:30 PM (PDT) @ Exhibit Hall 1 #430

Reward design is one of the most critical and challenging aspects when formulating a task as a reinforcement learning (RL) problem. In practice, it often takes several attempts of reward specification and learning with it in order to find one that leads to sample-efficient learning of the desired behavior. Instead, in this work, we study whether directly incorporating multiple alternate reward formulations of the same task in a single agent can lead to faster learning. We analyze multi-reward extensions of action-elimination algorithms and prove more favorable instance-dependent regret bounds compared to their single-reward counterparts, both in multi-armed bandits and in tabular Markov decision processes. Our bounds scale for each state-action pair with the inverse of the largest gap among all reward functions. This suggests that learning with multiple rewards can indeed be more sample-efficient, as long as the rewards agree on an optimal policy. We further prove that when rewards do not agree, multi-reward action elimination in multi-armed bandits still learns a policy that is good across all reward functions.

Author Information

Christoph Dann (Google)
Yishay Mansour (Google and Tel Aviv University)
Mehryar Mohri (Google Research and Courant Institute of Mathematical Sciences)

More from the Same Authors