Timezone: »
Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems, since it enables offline evaluation of new policies using only historic log data. Unfortunately, when the number of actions is large, existing OPE estimators -- most of which are based on inverse propensity score weighting -- degrade severely and can suffer from extreme bias and variance. This foils the use of OPE in many applications from recommender systems to language models. To overcome this issue, we propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space. We characterize the bias, variance, and mean squared error of the proposed estimator and analyze the conditions under which the action embedding provides statistical benefits over conventional estimators. In addition to the theoretical analysis, we find that the empirical performance improvement can be substantial, enabling reliable OPE even when existing estimators collapse due to a large number of actions.
Author Information
Yuta Saito (Cornell University)
Thorsten Joachims (Cornell)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Spotlight: Off-Policy Evaluation for Large Action Spaces via Embeddings »
Thu. Jul 21st 08:10 -- 08:15 PM Room Room 309
More from the Same Authors
-
2023 Poster: Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling »
Yuta Saito · Qingyang Ren · Thorsten Joachims -
2022 : Learning from Preference Feedback in Combinatorial Action Spaces »
Thorsten Joachims -
2022 Poster: Improving Screening Processes via Calibrated Subset Selection »
Luke Lequn Wang · Thorsten Joachims · Manuel Gomez-Rodriguez -
2022 Spotlight: Improving Screening Processes via Calibrated Subset Selection »
Luke Lequn Wang · Thorsten Joachims · Manuel Gomez-Rodriguez -
2021 Poster: Fairness of Exposure in Stochastic Bandits »
Luke Lequn Wang · Yiwei Bai · Wen Sun · Thorsten Joachims -
2021 Spotlight: Fairness of Exposure in Stochastic Bandits »
Luke Lequn Wang · Yiwei Bai · Wen Sun · Thorsten Joachims -
2021 Poster: Optimal Off-Policy Evaluation from Multiple Logging Policies »
Nathan Kallus · Yuta Saito · Masatoshi Uehara -
2021 Spotlight: Optimal Off-Policy Evaluation from Multiple Logging Policies »
Nathan Kallus · Yuta Saito · Masatoshi Uehara -
2019 Poster: CAB: Continuous Adaptive Blending for Policy Evaluation and Learning »
Yi Su · Luke Lequn Wang · Michele Santacatterina · Thorsten Joachims -
2019 Oral: CAB: Continuous Adaptive Blending for Policy Evaluation and Learning »
Yi Su · Luke Lequn Wang · Michele Santacatterina · Thorsten Joachims