Timezone: »

Multi-Task Off-Policy Learning from Bandit Feedback
Joey Hong · Branislav Kveton · Manzil Zaheer · Sumeet Katariya · Mohammad Ghavamzadeh

Wed Jul 26 05:00 PM -- 06:30 PM (PDT) @ Exhibit Hall 1 #532

Many practical problems involve solving similar tasks. In recommender systems, the tasks can be users with similar preferences; in search engines, the tasks can be items with similar affinities. To learn statistically efficiently, the tasks can be organized in a hierarchy, where the task affinity is captured using an unknown latent parameter. We study the problem of off-policy learning for similar tasks from logged bandit feedback. To solve the problem, we propose a hierarchical off-policy optimization algorithm HierOPO. The key idea is to estimate the task parameters using the hierarchy and then act pessimistically with respect to them. To analyze the algorithm, we develop novel Bayesian error bounds. Our bounds are the first in off-policy learning that improve with a more informative prior and capture statistical gains due to hierarchical models. Therefore, they are of a general interest. HierOPO also performs well in practice. Our experiments demonstrate the benefits of using the hierarchy over solving each task independently.

Author Information

Joey Hong (Berkeley)
Branislav Kveton (AWS AI Labs)
Manzil Zaheer (Google DeepMind)
Sumeet Katariya (Amazon)
Mohammad Ghavamzadeh (Google Research)

More from the Same Authors