Timezone: »

On the Design of Estimators for Bandit Off-Policy Evaluation
Nikos Vlassis · Aurelien Bibaut · Maria Dimakopoulou · Tony Jebara

Wed Jun 12 12:10 PM -- 12:15 PM (PDT) @ Seaside Ballroom

Off-policy evaluation is the problem of estimating the value of a target policy using data collected under a different policy. Given a base estimator for bandit off-policy evaluation and a parametrized class of control variates, we address the problem of computing a control variate in that class that reduces the risk of the base estimator. We derive the population risk as a function of the class parameters and we establish conditions that guarantee risk improvement. We present our main results in the context of multi-armed bandits, and we propose a simple design for contextual bandits that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets.

Author Information

Nikos Vlassis (Netflix)
Aurelien Bibaut (UC Berkeley)
Maria Dimakopoulou (Netflix)
Tony Jebara (Netflix)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors