Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: Real-world Sequential Decision Making: Reinforcement Learning and Beyond

Miro Dudík (Microsoft Research) - Doubly Robust Off-policy Evaluation with Shrinkage

Miroslav Dudik

[ ] [ Project Page ]
2019 Invited Talk

Abstract:

Contextual bandits are a learning protocol that encompasses applications such as news recommendation, advertising, and mobile health, where an algorithm repeatedly observes some information about a user, makes a decision what content to present, and accrues a reward if the presented content is successful. In this talk, I will focus on the fundamental task of evaluating a new policy given historic data. I will describe the asymptotically optimal approach of doubly robust (DR) estimation, the reasons for its shortcomings in finite samples, and how to overcome these shortcomings by directly optimizing the bound on finite-sample error. Optimization yields a new family of estimators that, similarly to DR, leverage any direct model of rewards, but shrink importance weights to obtain a better bias-variance tradeoff than DR. Error bounds can also be used to select the best among multiple reward predictors. Somewhat surprisingly, reward predictors that work best with standard DR are not the same as those that work best with our modified DR. Our new estimator and model selection procedure perform extremely well across a wide variety of settings, so we expect they will enjoy broad practical use.

Based on joint work with Yi Su, Maria Dimakopoulou, and Akshay Krishnamurthy.

Chat is not available.